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Introduction 
f—. 



The purpose 'of the project was to determine if Hasch Model 
procedures have any utility for equating pre-existing tests* 

Specifically, we reanalyzedi» the data from the equating phase of'-the . 

• ' * - X— 
' Anchor Test Study. ' , o 

Tne- purpose of this management repor,t;^is to suinmarize the work 
completed in the projec-t, to describe the differences In our work' 
and that of Educational^ Testing Service in tlie Anchor- Test Project, 
and to present our recommendations anrd conclusions to t|ie U« S* 

o • 

Office of Education, ^ ^ 

« Summary of Rasch Project 

Data Qr^apization 

Tests involved included seven reading test batterisb, ea,ch < 

having ^rom oncf to three levels and ^two forms, and each having a 

vocabulary and comprehension subtest • * 'Fhere were 28 form-level ''^ 

combinations* possible. Therefore, w^"* were concerned with the f, 

simultaneous equating of 28 'tests for each of vocabulary, comprehension 
, * 

and total scores.. ' • ' ' v « 

I 

I 

We equated without regard to grade level of subjects, i.e., data 
from children who toojc a particular test were not subdivided by 
grade level when these children were members of more than one grade. 
Theoretical Orientation ^ 

We';reviewQd both Rasch theory and equating procedures literature. 

. • , . 3 



/ 



ERIC 



The following general principles eVolved f rotn this review, 

!• In situations in which two tests reasonably conform*' to Rasch! 
Model conditions and these, two tests are administered to a *s ingle group' 
of sub.-jects, then equating »simplifies to the determination of a single 
additive constant*. . • ' • * ^ . * 

2. The stability ^of equating depends entirely on tl^e stability- 

of the raw score to -ability 'score calibration; therefore, the observation 
of reasohably^tabie calibration, implies equating stability^ 

3. Equated raw scores can be defined as stores corresponding to . 
the same ability level, a definition that is ana;Logous to equlp^rcentile 
and linear model definitions. " ^ 

Equating Errors * . \ , ' 

We developed equations for estimating standard errors of equating 
constatlts and also estimated these values directly from*studies o^ 
calibration stability. 'We conclude that the major source of erilor is 
the usual measurement error. The error in the equating constants is 
trivial. ^There is an error involved in assigning raw scores to 
^quiva^ent raw scores that cgn be avoided by using r-eference scales 
instead of raw » score equating. The fourth possible error source is 
due to calibration instability which can be studied prior to -undertaking 
equating studies 
Data-Model Conformity 

Problems of .assessing "model-data fit" were discussed. The most 
reasonable recommendation is that "fit" be determined by the degree 
to which specific objectivity is observed. That is, if various 

J 4 . . - 
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samples yield similar results, then invarianci^ with xegat^ to samples 

* * * • • 

is preseijt, which, in turn., implies adequate fit for equating purposes, 

4^ Such studies were conducted by comparing multiple random 'samplers • • 
of size 500, 1000, 2000, and 4000; by comparing results of calibrations 
over/all occurences of a test in the sampling designj by studying the. 
diff ^rences^in calibrations for racial and intelligence level subsamples 
for all tests; and by addition^-lly studying STEP calibrations for 
subsamples divided by sex, grade, school system size, and school 
percentage of ^elfare families-; j 

■J 

Our general conclusion from these studies was that results were , 
adequately stable to support the use of the Kasch Model in equating 
these tests, . ' • * s 

M ethodology of Multiple Test Equating ^ 

Our procedures for developing equating .constants and their 
standard errors were presented in detail. The specific methodology 
is easily modified, for other possible .sampling designs. 

Methodology was also presented for using reference scales and 
for using, user-developed new tests composed of any items on any of 
\he tests included in our analyses. 
. Equating Tables 

- We present equating tables for both vocabulary and comprehension » 
that allpw a user to determine for a particular primary test form 
an equated score corresponding to any raw score o obtained on any of 
the other 13 primary forms or on the appropriate secondary^ (patallel) 
'form. ^ <^ ' • 



Due to the importance of assignment error, we present all 
possible assignment errors, , Moreover, we attempt to solve the 

' . ■ ^ • 

" • probleip of assignment error by recommending^ and providing a reference 
scale (oug: National Reference Scale) for interpretting all obtained 
test scores on a common scale. 

Test Calib ration Data ' ^ / 

For each of the 2^ tests separately for vocabulary, comprehension, 
' and total scores we present for each possible raw score the percentage 
' of children earning that score, the scorers corresponding ability 

estimate -(unadjusted) , the'standard error of . measurement for t^iat 

ability, our National Reference Scale score, and the NRS scor>e standard 

( 

error. Also presented is the total test Kuder-Rlchardson formula 20 

reliability estimate and the test's equating constant. 

"T tfem Analysis Data ^ 

For each item of each of the 28 tests, separately for vocabulary^ 

coiiiprehension, and total scores, we -present the following item data; 

difficulty (percentage correct), log easiness estimate, the corresponding. 
ir 

standard error of the easiness estimate, the point-biserial of the item 
with ability estimates, the item characteristiq curVe slope, and an 
item mean square fit index. j 
We present for each of the 28 tests -f*.r each, vocabulary, ^ 
comprehension, and total scorfjs summary data on all of the various 
item indexes. The sumniaries include frequency distributitjns, means, 
standard deviations, skewness i^ndexes, kurtosis indexes, medians, 
and semi-interquartile ranges. 
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% In addition, th^ relationship of these difficulties, easinesses, 
point-biserial^', and slopes to the, item mean square fit indexes is ^ 
displayed graphically 'for *all tests. 



Rasc h Pro.lect - A i ^chof Test Projec t Differences 
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iTie major ^objective of our project was^ to re^eguate Anchor. Test 
Project data using techniques .of Rasch Theory. Thus, the ob,viou8 
difference in our work' is* that we**used Ra$ch Model test calibrations 
and equating methods while the AncKior Test Project used a variety 
of equipercentile and linear model methods* However, there wfere , 
other imporfar^t methodological and output differences which will 
be outlined here, ^ . * 

Eqiiat'in^Raw Scores 

Results from the Anchor Test Project were based on data divided 
by grade level. Thus, they developed equating on somewhat different 
• data' than we used, as we kept togetlier all datg on a specific test* 
* * Tl^ey present equating tables separately for each grade and include 
i;rt each t'able only the s(^ven tests considered by test publishers as 
appropriate for that grade* Our tables allow a user to administer 
ouli.-o,f -grade-level tests' and equate the obtained score to an appropriate 
in -level test, 

* * 
^Moreover, the Anchor Test Project did riot provide tables for 

equating primary to secondary forms. Our tables allow for the 

conversion of secondary forms into primary forms, 

Referenca Scales 

The Anchor Test Project tables provide no way to avoid assignment 



errors or to- equate scores across test levels. * Our National Reference 

Scale solves both of these p.roblems^ With it, any of the 28 tests*"* 

can be given to any child and the resulting score, can be interpreted 

free of a^ssignment error ♦ - ' • - 

Specifically, the MRS allows considerable opport'unity to evaluate 
#» 

reading programs that involve growth over reading levels. Thus, .data 
over a several year period can be evaluated on a common metric:, allowing 
the opportunity for growth' to be revealed. Such scaling is a necessary 
prerequisite to assessing growth without resorting to grade, equivalent 
scores and their- *accoinpanying technical jweaknesses. 

It;, would be extremely valuable to obtafn data for ^^tending the 
NRS downward to readihg readiness levels and upward to junior high 
school levels. Moreover, the freedom to use afny of 28 tests -as 
essentially '*^arallel forms of each other can be quite Valuable in the 
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evaluation of programs requiring periodic assessmefnt. 
Comparisons, of Tables * * * 

/ 

Direct comparisons of equating tables of the two projects was 
presented for selected test pairs usfng subsamples of subjects who 
were administered both tests ih*" each test pair in the same order of 
testing. For each raw score on the base test three equated scores 
were obtained: the recommended Anchor Test Project^value, the 
recommended Rasch Project value, and a subgroup conditional mean. 
The reader can scan the tables to determine how similar the Y<irious 
results are. Several such tables are presented which differ in 
regard to model-data fit and grade level • In general, tables are 



quite similar. Oi: ten both projects yield t)le same equalled ^alue, 
many values differ by ^e or two points, ^vld only rarely are values 
different by more than three points. These differences are small 

^ • ./ . . 

relative to standard errors of ipeasuretnept. • ^ n 

A discussion of the difficulty /of comparing results Is also 
presented. * There is no legitimate way to ^ay whith .is %est"*. ' 
Thfe definition "bf '*best" will he largely dependent upon the 

theoretical orientation of t^e reader. At least, with a strong* 

/ 

model, such as the Rasch mo,del> one can gather information on 
whether or not tKe "results should be used. The equipercentile 
method does not lend itself to such tests.- 

Conclusions and Kec'ommendatipns 

" ■ ■ ■ . .:\ 

The following sections of this report contain some conclusions^ 
and recommend a tltMS^ relative to any future equating effort that - 
might 'be undertalcen. The topics discussed include tephnical issues 
as well ^s design and cost considerations. 
Raw Sp6re Equating ^ , ^ 

There are three ways to achieve comparability between the scores 
on two'or-more tests* The first is to construct parallel forms, a 
process that is quite rigorous, resulting in isomorphic test score 
sdales which by definition are equal. This procedure can only be 
accomplished at the test construction level and is mentioned here 
only to complete the context for the discussion that follows. The 
other two methods 'are the ones we have been concerned with in this 
study. They are raw *score- Lo^-raw score equating and raw score-to- 
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reference scale ^equating; ve will ^^call them raw score equating ^nd 
* reference scale equating, respectively* i 
in refetence ^cale equating, each. test to De equated has its 
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score scale translated into a single reference scal^i whose miits 

may or toay not be similar to its raw score scale** Two examples of 

these are the CEEB's Scholastic Aptitude Test scale of 500 to 80p . 

• and our own National Reference Scale for reading (see Volume I, 

♦ 

Fdnal Report) with aiv* effective score range. pf 1A4 to 263 (fpr the 
tests, included- here) * In equipercentile eqiiatlng the scale of 
percentile ranks is the reference scale, linear equating use4 a z-- 
score scale and* Rasch equating is based on the log ability scale. 

.Regardless, then, of the specific methot^of equating, each procedure 

/ * \ 

I 

has at least an impli/ad reference scale, and these reference scales , 

have their own unique proper t:tes. For example, the percentile 
/ . .f I " 

/rank scale is a descriptit)n o^f the performance of the calibrating 

sample; and, as such, would differ from Scimple* to sample* On the^^ 

other hand, tjie "Rasch ability scale does -not depend on the calibrating 

^ample; its values' are Invariant with respect to calibration by 

differing samples. Thus, regardless of specific method, tests to 

be equated are in fact translated' Into a particular reference scale 

whether that is the end product or not. 

An additdonal Step is taken with raw score equating* What 

happens is that twol raw scores are assigned to be equal wjien they , 

correspond* to the same reference scal^ score. Furthermore," when an 

equivalent, raw score on one test must be £^ssigned an equivalent . 



score on another test even when the difference between reference 
scale- scores is large, the result is what we called "assignment" 

A 

error". The mai>nitude of this error exceeds all other equating ' 
•errors by a ^significant amount, as we h^ve- illustrated in our final 
report (secion 5^1, particularly Table 5,1- 1),*^ This f<ictor alone > 
argues against raw score equating and, Indeed, that is our rec^mendatio 
Vert ical Equating - 

Another advantage of reference scale .equating is that it jpermits' 
^ the definition of a test scale across several levels of a tes't battery. 
A coinmon scale that spans several grade levels would permit the 
measurement and description of growth and change. ^qu^ting*'sev(^al\ 
levels of the test battery by means of a common reference scale ^s 
called verti cal equatinR and it is an important capability. Our - • 
National Reference Scale accotnplishes this for the tests used and 
covers grades We believe that the measurement of growtj^i was 

a serious omission in the original conception of the Anchor Test * 
design and ought to be included in any future equating efforts. 

•\- • ' < 

E /al un t ion_of_tJhi(^ Equating Process 

In spite of our efforts to arrive at something \\\^yL might be 
called a ''standard error of equating", or for that matter the efforts 
of the AncKor Test Study*s authors, a solution remains elusive. If 
is seemingly a simple matter to compute "standard errors" based on 
replications or to qonpute some root-raean-sqiiiared delation from 
expectation; however, to conclude from that procedure the superiority 
of one method over anothc^r focuses on oVily the Vco^isdstency" proqjerty 
of an estimation. It is perliap in this case more important to focus 
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on. the -property of bias to which neither we not the Anchot -Test S.tudy 

' ' \ » ' ■ 

diirected ourselves/ The issue 'is like making claims ' for test 
'reliability without dealing with validity. 

Additional research needs to be done to find a satisfactory ^>ay ^ 
to compare and evaluate equating methods. At Vhe present time, the 
fagt of the matter is that the Rasch Model procedure and the equipercentile 
procedure are not strictly comparable. These t^P methods, along with 
linear equating, are based, on di fferent definitions of an equated score. ' 

Perhaps feach does a good job of 'equating under its twn ^ definition but 

U • , . • ' - ' , 

it'is inappropriate to compare' methods that attempt -to do different 

/ r ■ ■ ■ 

LlAnEsV Atf^e same -.tiine , it stilUseems to be quite a meaningful , 

question to ask'^ "if .a person scores 43 on test A whaf would his • ' . 
score be if he had faken test B?" Thete may/ihdeed be several answers^ 
to that question or perhaps we need to -ref/rmulate the question before 
we can get a satisfactory answer. Additional research needs to ^be 
done before these answers will be clear. ' . • 

Poss_ibl"e_JDesi_pns and R£qujTag_jnmple Sizes 

The size of the sample will probably \^e ttoe r-ingle most important 
factor in determining cost of 'any future equating stiidy; however, the 
particular design that might be used is intimately associated with 



the sample size' question. In the case. where it is imp^ossible-'to 
administer 4i t^^ts to be equated to all examinees it would se 
that some' sampling procedure lil<;p'\that used for the Anchor Test Study . 
would be most feasible. Angoff (1971) discusses several designs 
for data collection and Brigman (1976) has specifically compared 

12 
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three desigris sii^ilar* to t;he Anchor '*T9st Study desigti;.we will rely 
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on be'*work.ln making -some observations "^bout designs. *^ 

* »* * 

Bri^man (1976) compared .the "full* matrix" design (where tests are 
administered in all possible -^airs as was done in the ATS) with two 

reduced^ mo^fels called a /*chaih" design and a "vector"' design. These. 

f 

t^^ee d'^si^ns^are illustrated l>elow: , 



Fi^;ufe 1, Three Designs; for Equating . 



' ■ \ h. b. '!a ii !2 V !i h h Ji h. 

T, x/x XXX "T^ x/x X o X T/x/x x o o 

T^ X x/x* X ^ "^2 ^ ^ ° * '^2 ^ 

X' X x/x X T-.. o • X x/x X o X x/x o 

T, X X X x/x T, X o X x/x T, o x o x/x 



Tlie situations depicted call for the equating of A tests* %Tie 
"x?x" moans that a parn^mlar test is ^iven along with its secondary 
form. In all instances the i^ow index test is administered first; 
and, feince the matrices are symmetrical, it is obvious that when 
ever a particular tfest pair is administered ^he designs call for 2 
repljicafions w^ith order-of-administration balanced..' ^ 

As- pointed out earlier Rasch Model Equating depends entirely 
*on the estimation of the "equating constant", the translation 

) •' ■ '13 



factor which when added to the ability scale of one test in a test- 

* "^^^"^^^^ " ' 

pair equalizes the'^e^d.gip of both- tests in that pair. The estimation 

r * ' . " 

of this: constant requires only an estimate of the difference in 

average item diffi(^ulty between the two tests and constitutes the 

amount of adjustment nec^essary. Brigman found na essential difference 

-7 

betveen equating constants estimated from each design,' 

The importance of this findings is that any of these designs 

n 

could be chosen for purposes other than adequate estimation bf the 

equating constant. For example, the chain design might be bgst for 

the vertical equating of^different levels of a test battery since 

if ' " .' 

adjacent lo\ Is could be administered- to the same groilp whereas 

^ I -J 

nonadjaccnt levels wou].d be inappropriate (gra)-ite'd of course that 
we clmiovita cells - and T^^ from the design). On the 

other haryfl the vector design would be appropriate for ^an equating 
stud;^ ffke the Arichor Test Study since one test could be administered 
in coinbinatioQ with, all others at considerable reductiou'in the 
number of cells for which data were collected, 12 in the case of ' 
the Full Mat^rix design and 6 for the Vector dc^sign (ignoring 
diagonnlx^ in the above example) • ^ 

In her study Brigman also investigated sample size, using cell 
sixes of 125, 250, and 500. Again there was no difference. Our own 

work indicates that samples of 500 rroduces sufficient stability but 
that stability did increase up tc about 1000 and then began to l9vel 
off.. Our conclusion about a required sample size is based on a per 
cell size- of 500s to 1000; 500 would be inadequate, beyond 1000 would 



b^wasteful; and no one would believe 125. 

Estimates of Cost 

It is not possible to estimate dollar costs for any future 

equating efforts; however, it is possible ^^Mii^ntify cost factors 

that will determine dollar amounts. Two factors appear to us to be 

important: (i) th^ amount of data that need to be collected and 

(2) the extent to which the (Contractor's data processing capability 

has been developed. Any project will have a core of personnel which 

should be relatively constant across projects; however, projects may 

vary in personnel due to the two factots mentioned above. The sam6 

is true for supplies,, materials and operations. ^^We believe that 
• * « 

considerable savings might be rodlized by funding equating studies 
t 

in phases and we would like to deal brief ly' with one of these. 
r 

Equa ting Prereq uisites » 

We have stressed the-^oint many times that equating with the 
•Rasch Model is simple and straight forward p rovjLde^ there is> an 
acceptable degree of model-data fit. Evaluation of model-data 
fit ought to be separated .from actual equating and furthermore, 
the funding should be separated. Model-data fit is the central 
quc>stion whenever the 'Rasch Model is to be applied to existing^ 
tests* Studies of this sort* could, be made without collecting 
additional, dr.ta xf for example publishers could be persuaded to 
let a contractor use data they already have, an arrangement which 
we have fou'n^ sttccessfyl in the past. If fit studies prove successf 
then there need be little concern for elaborate and costly sampling 
plans for an equating phase. 



